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ABSTRACT 



Several person- fit statistics have been proposed to detect 
item score patterns that do not fit an item response theory model. To 
classify response patterns as not fitting a model, a distribution of a 
person-fit statistic is needed. The null distributions of several fit 
statistics have been investigated using conventionally administered tests, 
but less is known about the distribution of fit statistics for computerized 
adaptive testing (CAT) . A three-part simulation to study this distribution is 
described. First the theoretical distribution of the often used 1 (z) 
statistic across theta levels in a conventional testing and in CAT testing 
was studied, where theta and estimated theta were used to calculate l(z). 
Also, the distribution of a statistic l*(z), that is corrected for the error 
in theta, proposed by T. Snijders (1998) was studied in both testing 
environments. Simulating the distribution of 1 (z) for the two-parameter 
logistic model for conventional tests was studied. Two procedures for 
simulating the distribution of 1 (z) and l*(z) in a CAT were examined: (1) 

item scores were simulated with a fixed set of administered items; and (2) 
item scores were generated according to a stochastic design, where the choice 
of the administered item i + 1 depended on responses to previously 
administered items. The third study was a power study conducted to compare 
detection rates of l*(z) with l(z) for conventional tests. Results indicate 
that the distribution of l(z) differed from the theoretical distribution in 
conventional and CAT environments. In a conventional testing situation, the 
distribution of l(z) was in accord with the theoretical distribution, but for 
the CAT the distribution differed from the theoretical distribution. In the 
context of conventional testing, simulating the sampling distribution of 1 (z) 
for every examinee, based on theta, resulted in an appropriate approximation 
of the distribution. However, for the CAT environment, simulating the 
sampling distributions of both 1 (z) and l*(z) was problematic. Two appendixes 
show the derivation of the l*(z) statistic and discuss modeling local 
dependence. (Contains 6 tables, 3 figures, and 24 references.) (Author/SLD) 
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Abstract 



Several person-fit statistics have been proposed to detect item score patterns that do 
not fit an item response theory model. To classify response patterns as not fitting a model 
a distribution of a person-fit statistic is needed. Recently, the null distributions of several 
fit statistics have been investigated using conventional administered tests. For computerized 
adaptive testing (CAT), however, less is known about the distribution of fit statistics. In this 
study a three part simulation study was conducted. First, the theoretical distribution of the often 
used ^-statistic across 0-levels in a conventional testing and CAT environment was investigated, 
where 0 and 0 were used to calculate l z . Also, the distribution of a statistic /*, that is corrected 
for the error in 0, proposed by Snijders (1998), was investigated in a conventional testing and 
CAT environment. Second, simulating the distribution of l z for the 2PLM for conventional 
administered tests was investigated. TWo procedures for simulating the distribution of l z and 
l\ in a CAT were examined: (1) item scores were simulated with a fixed set of administered 
items, and (2) item scores were generated according to a stochastic design, where the choice of 
the administered item i + 1 depended on the responses to previous administered items. Third, 
a power study was conducted to compare the detection rates of with l z for conventional 
tests. Results indicated that the distribution of l z differed from the theoretical distribution in a 
conventional and CAT environment. In a conventional testing situation, the distribution of was 
in concordance with the theoretical distribution. However, for a CAT the distribution differed 
from the theoretical distribution. In the context of conventional testing, simulating the sampling 
distribution of l z for every examinee, based on 0, resulted in an appropriate approximation of 
the distribution. However, in a CAT environment, simulating the sampling distributions of both 
l z and l* was problematic. 
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Simulating the Null Distribution - 2 



Simulating the Null Distribution of Person-Fit Statistics 
for Conventional and Adaptive Tfests 

Item responses that do not fit the assumed item response theory (IKT) model may cause 
the latent trait value 6 , to be inaccurately estimated. Possible interpretations for nonfitting test 
behavior include test anxiety, guessing, cheating on achievement tests, or response distortion 
as a result of faking the answers on personality inventories (Zickar & Drasgow, 1996). Person- 
fit statistics have been proposed to detect nonfitting score patterns (e.g., Drasgow & Levine, 
1986; Meijer, 1994; Tatsuoka, 1984), and the effectiveness of these statistics to detect nonfitting 
response vectors has been investigated (e.g., Drasgow, Levine, & McLaughlin, 1987, 1991). 
However, most person-fit studies concentrated on conventionally administered tests, or paper- 
and-pencil (P&P) tests. With the increasing use of computerized adaptive tests (CAT), additional 
research is needed with respect to the application of person-fit statistics using these types of 
tests. 

A few studies investigated the usefulness of person-fit analysis in CAT. Candell (1988; 
cited in Drasgow, Levine, and Zickar, 1996) used optimal person-fit statistics in which the 
likelihood of a normal responding person was compared with the likelihood under an aberrant 
model to study the ability of a likelihood ratio test to identify simulated aberrant examinees. 
Although this approach is interesting because it has maximum power against a specified 
alternative, the drawback is that for other types of aberrant responding the power is low. 

Nering (1997) examined the distribution of two fit statistics, l z (Drasgow, Levine, & 
Williams, 1985), a standardized version of the log-likelihood statistic lo proposed by Levine 
and Rubin (1979), and ECIA Z (Tatsuoka, 1984), in a CAT environment. Nering found that the 
empirical distribution was dramatically different from the theoretical distribution. As a result, 
Nering concluded that ’when attempting to classify a response vector as model divergent (...) 
cutscores may have to be based on such factors as item pool size, item pool discrimination, and 
so forth’. An alternative to using a critical value derived from a theoretical distribution is to 
simulate for each examinee a distribution of a person- fit statistic based on the characteristics 
of the item bank and the estimated latent trait value of 6 , denoted as 6. Using the simulated 
distribution it can be determined how likely a response vector is under the IKT model. 

Snijders (1998) proposed to use an alternative standardization of the ^-statistic when 
6 was replaced by 0. This statistic is denoted here as /*. In a small simulation study Snijders 
(1998) investigated the distribution of l* in a conventional testing environment and found that 
the empirical distribution was close to the theoretical distribution. 



Simulating the Null Distribution - 3 
The purpose of the present study was to extend the Nering study and the Snijders study 
by examining the distribution of l z and l* z in a conventional testing and CAT environment and to 
investigate two different ways to simulate the distribution of Iq, l Zi and l* z . Besides, the detection 
rate of l z and l\ to detect nonfitting score patterns for conventional tests was investigated. 



Person-Fit Analysis 



In person-fit analysis, several fit statistics have been used in the context of the one—, 
two-, and three-parameter logistic model (1-, 2-, 3PLM) (Hambleton & Swaminatan, 1985, pp. 
35-48). In this study we use the 2PLM because it is less restrictive with respect to empirical 
data than the one-parameter logistic model and it does not have the estimation problems of the 
guessing parameter in the three parameter logistic model (e.g., Baker, 1992, pp. 109- 112). The 
2PLM has shown to have a reasonable fit to several achievement and personality data (e.g., 
Reise & W&ller, 1990; Zickar & Drasgow, 1996). 

Let X{ be the binary (0, 1) response to item i, where 1 denotes a correct or keyed 
response, and 0 denotes an incorrect or not keyed response. Further, let a* denote the item 
discrimination parameter and 6* the item difficulty parameter, then the probability of correctly 
answering an item according to the 2PLM can be written as 



Pi(0) = 



exp [a, (g — fcQ] 

1 + exp [a, ( 6 - 6i)] ' 



( 1 ) 



Levine and Rubin (1979) proposed the log-likelihood statistic, denoted as l 0 , as a measure of 
departure from the logistic IKT models, la can be written as 



l 0 = In 




( 2 ) 



Because la is confounded with 6, Drasgow, Levine and Williams (1985) proposed to use the 
standardized version of Iq, denoted as l z . This statistic equals 



/o-£(*o) 
[var(/ 0 )]2 ’ 



( 3 ) 



where E {Iq) and var(J 0 ) denote the expectation and variance of respectively. These 
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quantities are given by 



E(l 0 ) = { p i (#) ln [ p (#)] + (l - ^ (#)) ln [l - Pi (*)] } . ■ (4) 



and 



va I (i 0 ) = '£p,(d) (1 -Pi(e)) 

t=l 



, ft(*) 



(5) 



For classifying a response pattern as aberrant, an important tool is the probability of 
exceedance or significance probability. Because large negative values of l z indicate aberrance, 
the significance probabilities in the left tail of the distribution are of interest. Let t be the 
observed value of the person-fit statistic T. Then, the significance probability is defined as 
the probability under the sampling distribution that the value of the test statistic is smaller than 
the observed value of the statistic: p* = P (T < t). The value of a statistic with p* = a will 
be denoted as the critical value at significance level a. For example, for a standard normally 
distributed statistic the critical value at level a = 0.05 is —1.65. 

Drasgow et al. (1985) purported that, in the context of conventional testing or paper- 
and-pencil (P&P) testing, l z was distributed standard normal for long tests (tests longer than say 
80 items). However, several studies (e.g., Molenaar & Hoijtink, 1990; Meijer & Nering, 1997) 
showed that l z was not standard normally distributed for tests of realistic length (20 — 60 items). 
It was found that the distribution of i z was negatively skewed and that the normal approximation 
was inaccurate, especially in the tails of the distribution. As an alternative, Molenaar and 
Hoijtink (1990) proposed for the Rasch model three approximations to the distribution of 
conditional on the total score: using (1) complete enumeration, (2) Monte Carlo simulation 
and (3) a chi-square distribution, where the mean, standard deviation, and skewness of Iq were 
taken into account. Complete enumeration is suitable for very short tests. For tests of moderate 
length, a chi-square distribution was proposed for i 0 , conditional on the total score. For very 
long tests, an accurate calculation of the moments, needed for the chi-square approximation, 
is difficult and as an alternative, Monte Carlo simulation was applied. In the Rasch model the 
total score is a sufficient statistic for 0; for the 2PLM this is not the case, that is, the distribution 
of a person-fit statistic conditional on the total score is dependent on 0. As an alternative 0 can 
be used in the case of the 2PLM or 3PLM. However, care should be taken in doing this, because 
as Molenaar and Hoijtink (1990) noticed the statistical results on the distribution of a person-fit 
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statistic may change when substituting 9 for 9. 

Snijders (1998; see also Molenaar and Hoijtink, 1990) showed that, when the true 
person parameter is replaced by an estimate, the variance of the person-fit statistic decreased. 
When this decreased variance is not taken into account, this will lead to a conservative 
classification of nonfitting response patterns. Snijders derived the asymptotic distribution for 
several person-fit statistics, in which 9 was replaced by 9. He showed that the asymptotic 
distribution of 



lo — E (Iq) + Cn 


(»: 


) r 0 1 


(») 


VnTn ( 


:») 





( 6 ) 



is standard normal, where, for the 2PLM and for the weighted maximum likelihood estimator 
(Warm, 1989) 
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[! - p, (») 



t=l 



(7) 



( 8 ) 

(9) 



where P' (9) = dPi (9) /d9. See Appendix A for more details. Snijders performed a simulation 
study for relatively small conventional tests of 8 and 15 items, fitting the 2PLM, and using 
maximum likelihood estimation. The results showed that the approximation is satisfactory for 
a — 0.05 and a = 0.10, but that it was too liberal for smaller values of a. 



Person Fit in CAT 

Nering (1996, 1997) examined the distribution of l z within CAT by evaluating the 
first four moments (mean, standard deviation, skewness, and kurtosis) of the distribution. His 
results were in concordance with the results using conventional tests: the exact distribution of l z 
was different from the standard normal distribution. An important finding was that the normal 
approximation in the tails of the distribution was inaccurate: using a critical value of —1.65 
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resulted in a conservative classification of aberrant response patterns: p* = P (l z < —1.65) < 
0.05. On the basis of these results it can be concluded that the standard normal distribution is 
not useful to obtain a critical value in CAT. A possible solution for determining significance 
probabilities in CAT may be to approximate the distribution of a statistic by using simulation 
methods. .< 

Simulating Distributions of Person-Fit Statistics in CAT 

The significance probabilities can be determined by simulating the distribution of a 
person-fit statistic T> for example l z or Z*. This can be realized in at least two ways. One 
possibility is to determine the distribution of T by drawing a large number of 0-values from the 
standard normal distribution, each 0-value representing a person responding to the test. Then, 
item scores are simulated for each 0-value, according to the assumed IRT model and the CAT- 
procedure, and the value of T for each pattern is calculated; the T- values of these patterns 
constitute the simulated distribution based on the characteristics of the item bank. Another 
possibility is to simulate a distribution of T for each 0-value; that is, given 0, a large number 
of response patterns are simulated and for each pattern the T- value is calculated. So, now a 
distribution is simulated, based on the item bank and 0. Based on this distribution a critical 
value at level a can be determined. 

The first method results in using one critical value for all simulees, whereas the second 
method will probably result in using different critical values at different 0-values. When the 
distribution of T is the same across all 0-levels, the first method will result in an appropriate 
simulated distribution. 

Using the second method, the distribution of T can be simulated using a fixed sequence 
of items (test design) in which for each 0-value a large number of response vectors are simulated 
given the observed test design. Thus, each response vector consists of responses to the same 
items. However, an important aspect when simulating the distribution of a statistic is the 
stochastic process of item selection in a CAT (Glas, Meijer, & van Krimpen, 1997): in a CAT, 
the test design may be different for each simulee. To take this stochastic nature into account, 
the distribution can be simulated using a stochastic test design in which for each 0-value a large 
number of adaptive response patterns are simulated with, in principle, different test designs. 

Let the vector d denote the test design, that is, a vector of the numbers of administered 
items in CAT, and T (X) a statistic of the observed response vector X = (X 1} • • • , X k ) of a 
test with k items. In CAT, item selection is based on responses to previous administered items 
which are dependent on the ability of the examinee. Therefore, X is conditional on d and 0, 
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and a function of X, for example the statistic T (X) = T, is also conditional on d and 0. Thus, 
the distribution of T, conditional on d and 0 , is defined as 

/ (T| d,0) . (10) 

To obtain the unconditional distribution of T at a fixed 0-level, Equation 10 can be multiplied 
by the probability distribution of the design d, which results in 

f(T,d\O) = f(d\0)f(T\d,e). ( 11 ) 

Comparing the values of the statistic across examinees with the same 0 is difficult 
in a CAT environment, because in principle examinees respond to different tests. However, 
comparing significance probabilities of the observed value of the statistic across examinees is 
possible. For determining the significance probability, the distribution of a statistic, conditional 
on 0, can be simulated for a fixed test design (Equation 10) or a stochastic test design (Equation 
11). In both approaches the distribution can be approximated by replicating the test n times. 

Purpose of the Study 

This study was designed to investigate ( 1) the distribution of l z and l\ across different 0- 
levels and the influence of estimation errors of 0 on the distribution of l z and l * for conventional 
testing (P&P) and in CAT (Study 1), (2) the influence of estimation errors of 0 on simulating 
the distribution of / 0 and l z for conventional tests and CAT, and the influence of the stochastic 
nature of the test design in CAT (Study 2), and (3) the detection rate of l z and for several types 
of aberrant response behavior in a conventional testing situation (Study 3). This study thus both 
extend the Nering (1997) and the Snijders (1998) study. 

Study 1 

In this study the distributions of l z and l\ were investigated in a conventional and CAT 
situation. Nering (1997) examined the distribution of l z in a CAT environment by first drawing 
10,000 0-values from the standard normal distribution and then simulating adaptive response 
vectors for each 0-value. For each response vector, 0 was used to determine the value of l z . 
These 10,000 values constituted the simulated distribution of l z . In this study, the simulated 
distribution was determined by (1) drawing true 0 from a standard normal distribution, or (2) 
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fixing true 0 at different levels. In both (1) and (2), response vectors were generated and 0 was 
estimated by 0; 9 was used to determine the value of l z and Z*. Doing so, the critical values 
obtained by Nering can be compared with the critical values obtained when the distribution of 
l z is simulated at a fixed 0-level. Finally, l z was also calculated using true 0, that is P (0) was 
used to determine the value of l z . This enables us to investigate the influence of estimation 
errors in 0 on the distribution of l z . Note that Z* is only an appropriate standardization when 0 
is used. 

Method 

P&P. Tests of 20, 50, and 80 items fitting the 2PLM were constructed, with a* ~ 
7V(l; 0.2) and bi ~ U(— 3; 3); each test was fixed for all simulees. For each test, ten datasets 
consisting of 10, 000 response vectors were constructed. Nine datasets were constructed at nine 
different 0-levels: 0 = —2, —1.5, —1, —0.5, 0, 0.5, 1, 1.5, and 2; one dataset was constructed 
where 10, 000 0's were drawn from a standard normal distribution. 

First, for each response vector the values of l z and Z* were calculated using 0 and these 
10, 000 values of l z and Z* were used to obtain the distribution of l z and Z* for each dataset; 0 was 
estimated using weighted maximum likelihood estimation (Warm, 1989); this estimator is less 
biased than the maximum likelihood estimator, and also exists for patterns with only 1-scores 
or only 0-scores. For all simulated distributions the critical values at level a were determined 
and compared with the critical values at level a of the standard normal distribution, where 
a = 0.01, 0.02, 0.03, 0.04, and 0.05. Furthermore, the first four moments (mean, standard 
deviation, skewness, and kurtosis) of the simulated distribution of l z and Z* were computed and 
compared with the moments of the standard normal distribution. Second, l z was also calculated 
using true 0 for each response vector to constitute the distribution of l z without presence of 
estimation errors. 

CAT. Ten datasets consisting of 10,000 adaptive response patterns were constructed. 
Nine datasets were constructed at nine different 0-levels: 0 = —2, —1.5, —1, —0.5, 0, 0.5, 1, 
1.5, and 2; one dataset was constructed where 10,000 0’s were drawn from a standard normal 
distribution. 

A pool of 400 items fitting the 2PLM with a» ~ iV(l;0.2) and bi ~ U{— 3; 3) was 
used. An adaptive response pattern was simulated as follows. First, the true 0 of a simulee was 
drawn from a standard normal distribution or was set to a fixed 0-level, dependent on the dataset 
constructed. Then, the first item of the CAT selected was the item with maximum information 
given 0 = 0. For this item, P (0), according to Equation 1 was determined. To simulate the 
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answer (1 or 0), a random number y from the uniform distribution on the interval [0, 1] was 
drawn; when y < P (0) the response to item i was set to 1 (correct response), 0 otherwise. The 
first four items of the CAT were selected with maximum information for 0 = 0, and based on 
the responses to these four items, 0 was obtained. The next item selected was the item with 
maximum information given?. For this item, P (0) was computed, a response was simulated, 
0 was estimated and another item was selected based on maximum information given 0 at that 
stage. This procedure was repeated until the asymptotic standard error of 0 was 0.25; this is an 
often used value, see for example DeAyala (1992) and Nering (1997). The asymptotic standard 
error of 0 was determined by 



SE 0 = 



£> t 2 P(0)(l-P(0)) 



-,- 1/2 



( 12 ) 



where the sum was across all administered items and Pi (0) was defined by the 2PLM given in 
Equation 1; the standard error was estimated by substituting 0 for 0. 

For each response vector the values of l z and l* z were calculated using 0 and these 
10,000 values of l z and l* were used to obtain the distribution of l z and l* z for each dataset. 
Also, l z was calculated using true 0 for each response vector, to constitute the distribution of l z 
without presence of estimation errors. 



Results 
Using 0 

In Table 1 the first four moments and the critical values at level a of the simulated 
distribution of l z , using 0, are given for different 0-levels for a conventional test of 20, 50, and 
80 items, and for a CAT. In Table 2 the first four moments and the critical values at level a of 
the simulated distributions of l* z are given for different 0-levels for a conventional test of 20, 50, 
and 80 items, and for a CAT. 

P&P, Table d shows that, for the conventional test of 20 items, the mean and variance 
of the sampling distribution of l z were different from 0 and 1 as expected under the standard 
normal distribution. For longer tests (50-80 items), the first two moments of the distribution of 
l z were closer to the 0 and 1 . However, the distribution tended to be negatively skewed, at all test 
lengths; for example, for the test with 80 items, the highest skewness observed was -0.36 for 
0 = -2.0 and -1.5. The kurtosis tended to be slightly positive for all test lengths; for example 
for the test of length 50, the kurtosis varied between 0.10 and 0.45 for 0 = -0.5 and —2.0, 
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Table 1. Distributional characteristics of the simulated distribution of Z z , using 0. 



critical value 







mean 


variance 


skewness 


kurtosis 


0.01 


0.02 


0.03 


0.04 


0.05 


P&P 20 items 


0 rsj 


N(0,1) 


0.16 


0.77 


-0.67 


0.55 


-2.25 


-1.92 


-1.73 


-1.59 


-1.46 




-2.0 


0.16 


0.56 


-0.80 


0.93 


-1.98 


-1.68 


-1.47 


-1.34 


-1.24 




-1.5 


0.12 


0.75 


-0.75 


0.92 


-2.34 


-1.97 


-1.74 


-1.59 


-1.46 




-1.0 


0.12 


0.88 


-0.66 


0.46 


-2.55 


-2.20 


-1.94 


-1.74 


-1.59 




-0.5 


0.12 


0.94 


-0.61 


0.40 


-2.53 


-2.15 


-1.94 


-1.76 


-1.61 




0.0 


0.12 


0.89 


-0.69 


0.58 


-2.52 


-2.16 


-1.92 


-1.77 


-1.62 




0.5 


0.17 


0.78 


-0.62 


0.40 


-2.27 


-1.94 


-1.72* 


-1.56 


-1.43 




1.0 


0.18 


0.66 


-0.68 


0.52 


-2.11 


-1.77 


-1.57 


-1.41 


-1.29 




1.5 


0.20 


0.58 


-0.73 


0.41 


-1.90 


-1.64 


-1.46 


-1.33 


-1.22 




2.0 


0.24 


0.48 


-0.85 


0.75 


-1.75 


-1.45 


-1.28 


-1.16 


-1.08 


P&P 50 items 


0 ~ 


N(0,1) 


0.09 


0.86 


-0.41 


0.20 


-2.32 


-2.01 


-1.80 


-1.66 


-1.54 


e = 


-2.0 


0.09 


0.54 


-0.52 


0.45 


-1.86 


-1.60 


-1.43 


-1.31 


-1.23 




-1.5 


0.09 


0.69 


-0.43 


0.27 


-2.07 


-1.77 


-1.61 


-1.49 


-1.39 




-1.0 


0.08 


0.87 


-0.44 


0.32 


-2.43 


-2.05 


-1.82 


-1.66 


-1.54 




-0.5 


0.09 


0.97 


-0.41 


0.10 


-2.51 


-2.17 


-1.92 


-1.76 


-1.64 




0.0 


0.07 


0.99 


-0.42 


0.22 


-2.51 


-2.20 


-1.95 


-1.80 


-1.67 




0.5 


0.09 


0.94 


-0.38 


0.15 


-2.42 


-2.10 


-1.87 


-1.72 


-1.59 




1.0 


0.09 


0.86 


-0.37 


0.13 


-2.28 


-2.01 


-1.79 


-1.65 


-1.54 




1.5 


0.10 


0.76 


-0.43 


0.28 


-2.21 


-1.88 


-1.68 


-1.54 


-1.42 




2.0 


0.09 


0.64 


-0.46 


0.33 


-2.03 


-1.74 


-1.55 


-1.43 


-1.33 


P&P 80 items 


0 


N(0,1) 


0.06 


0.89 


-0.32 


0.16 


-2.37 


-2.04 


-1.84 


-1.68 


-1.58 


6 — 


-2.0 


0.08 


0.55 


-0.36 


0.17 


-1.87 


-1.59 


-1.43 


-1.32 


-1.21 




-1.5 


0.07 


0.71 


-0.36 


0.19 


-2.10 


-1.83 


-1.67 


-1.54 


-1.42 




-1.0 


0.07 


0.85 


-0.35 


0.04 


-2.28 


-1.97 


-1.78 


-1.63 


-1.53 




-0.5 


0.06 


0.97 


-0.30 


0.10 


-2.45 


-2.13 


-1.91 


-1.74 


-1.63 




0.0 


0.07 


1.01 


-0.34 


0.16 


-2.49 


-2.16 


-1.92 


-1.79 


-1.68 




0.5 


0.08 


0.96 


-0.33 


0.12 


-2.47 


-2.10 


-1.90 


-1.73 


-1.63 




1.0 


0.08 


0.90 


-0.31 


-0.01 


-2.28 


-2.01 


-1.82 


-1.67 


-1.55 




1.5 


0.08 


0.80 


-0.33 


0.14 


-2.25 


-1.95 


-1.75 


-1.58 


-1.48 




2.0 


0.07 


0.91 


-0.34 


0.27 


-2.45 


-2.05 


-1.83 


-1.67 


-1.56 


CAT 


0 


N(0,1) 


0.39 


0.79 


-0.20 


0.03 


-1.78 


-1.51 


-1.32 


-1.22 


-1.13 


6 = 


-2.0 


0.27 


0.84 


-0.36 


0.06 


-2.08 


-1.77 


-1.58 


-1.44 


-1.32 




-1.5 


0.36 


0.84 


-0.25 


0.07 


-1.93 


-1.65 


• -1.48 


-1.32 


-1.21 




-1.0 


0.42 


0.76 


-0.17 


-0.10 


-1.70 


-1.45 


-1.27 


-1.16 


-1.07 




-0.5 


0.40 


0.75 


-0.16 


-0.19 


-1.66 


-1.44 


-1.27 


-1.17 


-1.08 




0.0 


0.40 


0.77 


-0.19 


-0.11 


-1.72 


-1.48 


-1.33 


-1.21 


-1.11 




0.5 


0.41 


0.81 


-0.18 


-0.05 


-1.72 


-1.50 


-1.34 


-1.23 


-1.13 




1.0 


0.38 


0.83 


-0.22 


-0.10 


-1.83 


-1.60 


-1.42 


-1.29 


-1.20 




1.5 


0.34 


0.82 


-0.30 


-0.01 


-1.97 


-1.64 


-1.45 


-1.33 


-1.22 




2.0 


0.32 


0.78 


-0.39 


0.19 


-1.99 


-1.69 


-1.48 


-1.33 


-1.21 
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Figure 1. Critical values, at a = 0.05, of the simulated distribution of l zt using 9. 
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Simulating the Null Distribution - 12 
respectively. A positive kurtosis indicates a leptokurtic distribution, that is, a distribution with 
heavier tails and a higher peak than the standard normal distribution. Table 1 also shows that for 
different 0-levels the observed critical values were different. For example, for a test of 20 items, 
the observed critical values at a = 0.01 varied between —2.55 and —1.75 for 0 = —1.0 and 
2.0, respectively, whereas the critical value at a = 0.01 under the standard normal distribution 
is —2.33. In Figure 1 the critical values of the distribution of l z , using 0, at a = 0.05, are plotted 
against 0 for the conventional tests of 20, 50, and 80 items and a CAT. These critical values 
are compared with the critical value expected under the standard normal distribution, that is, 
— 1.65. The distribution of l z for a CAT will be discussed below. Figure 1 shows that for longer 
tests and for —1 < 0 < 1 the critical values observed in the simulated distribution were close 
to —1.65. So, especially for large positive and large negative 0-values the critical values in the 
simulated distribution were different from the expected critical value under the standard normal 
distribution. 

Table 2 shows that the mean and variance of the distribution of l* z were close to 0 and 
1, respectively, for all 0-values and conventional tests of 20, 50 and 80 items. For example, for 
a test of 50 items the mean of l* z was 0.09 at 0 = 0.5 and 0 = —0.5, and 0.05 at 0 = —2.0, 
and the variance was for all 0-values approximately 1. However, for all tests and across all 
0-values the distribution tended to be negatively skewed; for a test of 20 items, and 0 = 0 the 
skewness was -0.68. Also, the kurtosis tended to be positive; for example, a test of 80 items 
and 0 = 0 the kurtosis was 0.16. Table 2 also shows that the critical values in the simulated 
distribution were approximately equal across 0-levels. However, for smaller values of a the 
critical values in the simulated distribution differed from the critical values of the standard 
normal distribution. Moreover, for a < 0.04 the use of l* z resulted in a slightly conservative 
classification of nonfitting response patterns; for example, for the standard normal distribution 
the critical value at a = 0.01 is —2.33 and for the test of 50 items the critical value in the 
simulated distribution of l\ vary from -2.60 to —2.43 at 0 = —1.0, and 0 = — 1.5, respectively. 
Figure 2 shows the critical values at a = 0.05 of the simulated distribution of l* z across 0- 
levels for the three conventional tests and for a CAT. The results of the CAT will be discussed 
below. Table 2 and Figure 2 both show that the critical values at a = 0.05 in the simulated 
distribution are close to -1.65 as expected for the standard normal distribution; for example, 
for the conventional test of 20 items the critical values at a = 0.05 in the simulated distribution 
varied from —1.75 to —1.65 for 0 = —2.0 and 6 = 1.0, respectively. 

CAT Environment. Table 1 shows that, for a CAT, the first two moments of the 
distribution of l z are substantially different from 0 and 1 for all 0-levels; mean and variance 
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Table 2. Distributional characteristics of the simulated distribution of /*. 



critical value 







mean 


variance 


skewness 


kurtosis 


0.01 


0.02 


0.03 


0.04 


0.05 


P&P 20 items 


6 ~ 


N(0,1) 


0.13 


0.97 


-0.65 


0.40 


-2.53 


-2.18 


-1.97 


-1.83 


-1.67 


9 = 


-2.0 


0.09 


0.98 


-0.73 


0.49 


-2.68 


-2.30 


-2.01 


-1.86 


-1.75 




-1.5 


0.10 


1.01 


-0.70 


0.55 


-2.73 


-2.28 


-2.05 


-1.88 


-1.74 




-1.0 


0.11 


1.00 


-0.66 


0.40 


-2.72 


-2.32 


-2.07 


-1.88 


-1.73 




-0.5 


0.11 


1.01 


-0.61 


0.36 


-2.64 


-2.26 


-2.02 


-1.83 


-1.69 




0.0 


0.11 


1.01 


-0.68 


0.51 


-2.71 


-2.30 


-2.07 


-1.88 


-1.75 




0.5 


0.15 


0.98 


-0.61 


0.27 


-2.54 


-2.20 


-2.00 


-1.80 


-1.66 




1.0 


0.14 


0.97 


-0.67 


0.37 


-2.58 


-2.25 


-2.00 


-1.82 


-1.65 




1.5 


0.14 


0.97 


-0.71 


0.25 


-2.60 


-2.19 


-2.00 


-1.84 


-1.70 




2.0 


0.16 


0.94 


-0.77 


0.39 


-2.47 


-2.16 


-2.00 


-1.82 


-1.67 


P&P 50 items 


6 ~ 


N(0, 1) 


0.08 


0.99 


-0.41 


0.13 


-2.50 


-2.17 


-1.94 


-1.79 


-1.67 


9 = 


-2.0 


0.05 


1.00 


-0.50 


0.25 


-2.60 


-2.27 


-2.04 


-1.89 


-1.74 




-1.5 


0.07 


0.98 


-0.42 


0.12 


-2.43 


-2.17 


-1.96 


-1.81 


-1.68 




-1.0 


0.07 


1.01 


-0.43 


0.27 


-2.60 


-2.21 


-2.00 


-1.82 


-1.68 




-0.5 


0.09 


1.01 


-0.40 


0.09 


-2.59 


-2.21 


-1.97 


-1.79 


-1.67 




0.0 


0.07 


1.02 


-0.42 


0.21 


-2.55 


-2.23 


-1.99 


-1.82 


-1.70 




0.5 


0.09 


1.01 


-0.38 


0.14 


-2.52 


-2.19 


-1.95 


-1.80 


-1.65 




1.0 


0.07 


1.01 


-0.38 


0.12 


-2.52 


-2.18 


-1.98 


-1.82 


-1.70 




1.5 


0.08 


1.01 


-0.42 


0.19 


-2.56 


-2.20 


-1.97 


-1.81 


-1.67 




2.0 


0.06 


1.01 


-0.46 


0.23 


-2.57 


-2.22 


-2.02 


-1.86 


-1.72 


P&P 80 items 


9 ~ 


N(0,1) 


0.05 


1.00 


-0.32 


0.10 


-2.52 


-2.18 


-1.97 


-1.79 


-1.68 


9 = 


-2.0 


0.05 


0.99 


-0.36 


0.04 


-2.54 


-2.21 


-2.00 


-1.83 


-1.70 




-1.5 


0.05 


0.99 


-0.37 


0.14 


-2.55 


-2.19 


-1.98 


-1.84 


-1.69 




-1.0 


0.07 


0.98 


-0.35 


0.02 


-2.43 


-2.13 


-1.93 


-1.76 


-1.65 




-0.5 


0.06 


1.01 


-0.30 


0.10 


-2.49 


-2.16 


-1.94 


-1.78 


,-1.66 




0.0 


0.06 


1.02 


-0.34 


0.16 


-2.50 


-2.17 


-1.93 


-1.80 


-1.69 




0.5 


0.07 


1.00 


-0.33 


0.13 


-2.52 


-2.15 


-1.93 


-1.76 


-1.65 




1.0 


0.08 


1.00 


-0.31 


-0.02 


-2.42 


-2.14 


-1.94 


-1.78 


-1.66 




1.5 


0.07 


1.00 


-0.33 


0.10 


-2.50 


-2.18 


-1.95 


-1.79 


-1.68 




2.0 


0.06 


1.01 


-0.33 


0.17 


-2.57 


-2.19 


-1.95 


-1.79 


-1.67 


CAT 




N(0,1) 


0.39 


0.83 


-0.23 


0.10 


-1.87 


-1.56 


-1.37 


-1.27 


-1.17 


0 = 


-2.0 


0.26 


0.87 


-0.39 


0.14 


-2.14 


-1.84 


-1.62 


-1.48 


-1.35 




-1.5 


0.36 


0.86 


-0.28 


0.14 


-1.98 


-1.68 


-1.50 


-1.35 


-1.23 




-1.0 


0.42 


0.80 


-0.19 


-0.05 


-1.77 


-1.50 


-1.32 


-1.19 


-1.10 




-0.5 


0.41 


0.80 


-0.17 


-0.17 


-1.72 


-1.49 


-1.34 


-1.21 


-1.12 




0.0 


0.40 


0.79 


-0.21 


-0.06 


-1.76 


-1.54 


-1.37 


-1.23 


-1.13 




0.5 


0.41 


0.82 


-0.20 


-0.00 


-1.77 


-1.53 


-1.37 


-1.25 


-1.15 




1.0 


0.38 


0.85 


-0.24 


-0.04 


-1.88 


-1.64 


-1.45 


-1.32 


-1.22 




1.5 


0.34 


0.86 


-0.33 


0.06 


-2.05 


-1.68 


-1.50 


-1.36 


-1.26 




2.0 


0.33 


0.86 


-0.43 


0.29 


-2.14 


-1.82 


-1.57 


-1.42 


-1.29 
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Figure 2. Critical values, at a. = 0.05, of the simulated distribution of l* z . 
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fluctuated around 0.40, and 0.80, respectively. Skewness and kurtosis were also different from 
0. The distribution was found to be negatively skewed, and the highest skewness observed was 
-0.39 for 0 = 2.0. The highest kurtosis was found for 9 = -0.5 and 2.0 where the kurtosis 
was -0.19 and 0.19, respectively. Thus, the distribution of l z using 9 was quite different from 
the standard normal distribution. Table 1 and Figure 1 both show that the critical values in 
the sampling distribution tended to be closer to 0 than expected under the standard normal 
distribution for all 9 and a. For example, the critical value at a = 0.05 for 9 = 0, 5% of 
the simulees obtained a Rvalue below —1.11. Thus, using l z < -1.65 will result in too few 
simulees being classified as aberrant; that is, the decision rule will result in a conservative 
classification of aberrant response behavior. Table 1 also shows that the critical values were 
different across 0-levels. For example, for 9 = 0 the critical value at a = 0.01 was —1.72 
whereas the critical value for 9 = 2 was -1.99. When 9 was drawn from the standard normal 
distribution, the critical values were also closer to 0 than expected; for example, the critical 
value at a = 0.05 was —1.13. 

Table 2 shows that for a CAT the mean and variance of i* 2 were quite different from 0 
and 1 , respectively; for example, at 9 = 0 the mean and variance are 0.40 and 0.79, respectively. 
It also shows that the simulated distribution tended to be negatively skewed; the skewness 
varied from —0.17 to -0.43 at 9 = -0.5 and 9 = —2.0, respectively. The kurtosis was less 
systematically distributed; for — 1.0 < 0 < 1.0 the kurtosis was slightly negative, for other 
0-values positive kurtosis occurred. Figure 2 and Table 2 both show that the critical values in 
the simulated distribution of l* z were not in agreement with critical values of the standard normal 
distribution. For example, for a = 0.05 the critical values in the simulated distribution varied 
from —1.10 to —1.35 at 0 = —1.0 and 0 = —2.0, respectively. 

Using 0 

In Table 3 the first four moments and the critical values at level a of the simulated 
distributions of i z , when true 0 was used, are given. 

P&P. Table 3 shows that, for all conventional tests, the first two moments of the 
distribution of l z were close to 0 and 1, as expected under the standard normal distribution. 
However, the distributions are still negatively skewed and have positive kurtosis; for longer 
tests (50 — 80 items) the observed skewness fluctuated around 0.35 and the kurtosis around 
0.11. Table 3 also shows that the critical values were about the same across 0-levels. However, 
the critical values tended to be slightly smaller than expected; for example the critical value 
at a = 0.03 under the standard normal distribution is —1.88, and the values observed in the 



Simulating the Null Distribution - 16 



Table 3. Distributional characteristics of the simulated distribution of l z , using true 0. 







mean 


variance 


skewness 


kurtosis 


0.01 


critical value 
0.02 0.03 0.04 


0.05 


P&P 20 items 


0 ~ 


N(0,1) 


0.02 


0.98 


-0.66 


0.43 


-2.73 


-2.32 


-2.09 


-1.93 


-1.78 


6 = 


-2.0 


-0.01 


0.98 


-0.77 


0.74 


-2.78 


-2.39 


-2.16 


-2.00 


-1.85 




-1.5 


-0.00 


1.03 


-0.77 


0.75 


-2.87 


-2.49 


-2.12 


-2.01 


-1.86 




-1.0 


0.01 


1.01 


-0.67 


0.41 


-2.80 


-2.42 


-2.22 


-1.99 


-1.83 




-0.5 


0.01 


1.01 


-0.56 


0.32 


-2.75 


-2.34 


-2.11 


-1.92 


-1.79 




0.0 


-0.01 


1.00 


-0.63 


0.50 


-2.76 


-2.41 


-2.18 


-1.99 


-1.85 




0.5 


0.02 


0.98 


-0.63 


0.36 


-2.66 


-2.34 


-2.08 


-1.92 


-1.80 




1.0 


0.01 


0.99 


-0.70 


0.38 


-2.75 


-2.36 


-2.14 


-1.95 


-1.82 




1.5 


-0.00 


0.99 


-0.75 


0.42 


-2.73 


-2.42 


-2.17 


-1.99 


-1.85 




2.0 


0.01 


0.97 


-0.96 


1.20 


-2.90 


-2.47 


-2.20 


-1.96 


-1.81 


P&P 50 items 


6 ~ 


N(0,1) 


0.00 


0.99 


-0.42 


0.10 


-2.54 


-2.25 


-2.05 


-1.89 


-1.76 


6 = 


-2.0 


-0.02 


1.02 


-0.58 


0.38 


-2.76 


-2.36 


-2.13 


-1.97 


-1.85 




-1.5 


0.00 


1.01 


-0.51 


0.24 


-2.67 


-2.32 


-2.11 


-1.92 


-1.79 




-1.0 


-0.00 


1.01 


-0.47 


0.30 


-2.71 


-2.34 


-2.09 


-1.90 


-1.77 




-0.5 


0.01 


1.02 


-0.42 


0.12 


-2.63 


-2.29 


-2.10 


-1.90 


-1.77 




0.0 


-0.01 


1.02 


-0.40 


0.19 


-2.64 


-2.32 


-2.06 


-1.90 


-1.77 




0.5 


-0.00 


1.02 


-0.38 


0.12 


-2.63 


-2.27 


-2.03 


-1.90 


-1.77 




1.0 


-0.01 


0.99 


-0.39 


0.07 


-2.53 


-2.17 


-2.01 


-1.90 


-1.78 




1.5 


-0.01 


1.02 


-0.53 


0.34 


-2.73 


-2.35 


-2.13 


-1.94 


-1.81 




2.0 


-0.02 


1.00 


-0.61 


0.42 


-2.82 


-2.39 


-2.20 


-2.01 


-1.84 


P&P 80 items 


0 ~ 


N(0, 1) 


-0.01 


1.01 


-0.34 


0.12 


-2.60 


-2.24 


-2.04 


-1.88 


-1.75 


6 = 


-2.0 


-0.00 


0.97 


-0.48 


0.30 


-2.64 


-2.27 


-2.06 


-1.90 


-1.75 




-1.5 


-0.00 


0.98 


-0.43 


0.11 


-2.66 


-2.27 


-2.03 


-1.87 


-1.76 




-1.0 


-0.00 


0.99 


-0.39 


0.13 


-2.58 


-2.23 


-2.02 


-1.86 


-1.76 




-0.5 


-0.00 


1.01 


-0.32 


0.11 


-2.57 


-2.23 


-2.03 


-1.87 


-1.71 




0.0 


-0.00 


1.02 


-0.33 


0.16 


-2.58 


-2.23 


-2.00 


-1.88 


-1.76 




0.5 


0.00 


1.00 


-0.33 


0.13 


-2.58 


-2.22 


-2.00 


-1.85 


-1.72 




1.0 


0.01 


1.00 


-0.32 


0.00 


-2.51 


-2.24 


-2.00 


-1.85 


-1.72 




1.5 


-0.00 


1.00 


-0.41 


0.21 


-2.62 


-2.29 


-2.05 


-1.89 


-1.74 




2.0 


-0.01 


1.01 


-0.36 


0.23 


-2.68 


-2.29 


-2.04 


-1.88 


-1.75 


CAT 


0 ~ 


N(0,1) 


0.04 


0.95 


-0.23 


0.06 


-2.38 


-2.10 


-1.87 


-1.72 


-1.62 


6» = 


-2.0 


0.01 


0.95 


-0.36 


0.05 


-2.51 


-2.16 


-2.00 


-1.83 


-1.69 




-1.5 


0.01 


0.99 


-0.24 


0.01 


-2.47 


-2.16 


-1.95 


-1.81 


-1.70 




-1.0 


0.05 


0.95 


-0.24 


0.05 


-2.39 


-2.09 


-1.86 


-1.74 


-1.62 




-0.5 


0.05 


0.94 


-0.22 


-0.03 


-2.36 


-2.06 


-1.87 


-1.71 


-1.59 




0.0 


0.06 


0.92 


-0.28 


0.10 


-2.35 


-2.03 


-1.83 


-1.67 


-1.56 




0.5 


0.05 


0.95 


-0.20 


0.01 


-2.30 


-2.03 


-1.84 


-1.72 


-1.61 




1.0 


0.02 


0.97 


-0.28 


0.04 


-2.43 


-2.13 


-1.93 


-1.81 


-1.69 




1.5 


0.04 


0.93 


-0.32 


0.02 


-2.36 


-2.10 


-1.88 


-1.73 


-1.62 




2.0 


0.02 


0.94 


-0.40 


0.16 


-2.53 


-2.21 


-2.00 


-1.84 


-1.70 
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Figure 3. Critical values, at a — 0.05, of the simulated distribution of Z z , using 6. 
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simulated distributions for k = 80 are close to -2.00. In Figure 3 the critical values at a = 0.05, 
of the conventional tests and a CAT, are plotted against 0 and compared with the critical value 
expected under the standard normal distribution, that is —1.65. Figure 3 shows that the critical 
values are approximately the same across 0-levels and that the critical values using simulated 
data have larger negative values than expected under the standard normal distribution. 

CAT Environment. Using 0 to determine the distribution of l 2 resulted in a mean and 
variance close to 0 and 1, as expected under the standard normal distribution. However, the 
distribution tended to be negatively skewed, with the largest value of —0.40 for 9 = 2.0. For 
9 = 2.0 the highest kurtosis of 0.16 was obtained. It can be concluded that the distribution of 
l 2 using true 9 was more in agreement with the standard normal distribution than when 9 was 
used. Similar conclusions pertain for the critical values. Table 3 and Figure 3 both show that 
the critical values were close to —1.65 as expected under the standard normal distribution. 

Study 2 

In Study 1 it was shown that the critical values of the distribution of l* z were close to 
the critical values of the standard normal distribution for conventional tests. It was also shown 
that for long conventional tests (50 - 80 items) and -1 < 0 < 1 the critical values of l z were 
reasonably in agreement with the standard normal distribution. However, for extreme positive 
and negative 0-levels, the critical values found in the simulated distribution were quite different 
than expected under the standard normal distribution. An alternative to using critical values 
from the theoretical distribution is to simulate a distribution for a person-fit statistic for each 
simulee. In this second study, the distributions of / 0 and l 2 are simulated for every simulee, and 
the influence of estimation errors of 0 on the distributions of lo and l 2 were investigated in a 
conventional testing and CAT environment. 

With respect to CAT, it was shown in Study 1 that (1) the distributions of l 2 and l* z did 
not follow a standard normal distribution and (2) the distributions of l 2 and /* differed across 
0-levels when 0 was used; as a result, it is advisable to simulate the distribution conditional on 
0 or 0. Another aspect of this second study was to investigate the influence of the stochastic 
nature of the test design in CAT. 

Method 

P&P. Eight datasets of 400 model fitting response vectors fitting the 2PLM were 
constructed with ai ~ N( 1, 0.2) and ~ U(~ 3, 3); each dataset contained a test of different 
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test length, and each test was fixed for all simulees. Test length was k = 10, 20, 30, 40, 
50, 60, 70, and, 80 items. True 0 was drawn from the standard normal distribution, where 
each 0-value represented a simulee responding to a test; 0 was estimated by 0 using, weighted 
maximum likelihood estimation (Warm, 1989). For each simulee, the distributions of l 0 and l z 
were simulated in two different ways, both using parametric bootstrap techniques (Efron, 1982). 
First, for each simulee it was assumed that 0 equalled 0. For example, suppose a simulee with 
true parameter value 9 = 1.5 responded to a test and 9 = 1.2; then, for each simulee, 1, 000 
replications were generated with 9 — 1.2. For each replicated response pattern the values of lo 
and l z were determined to obtain the simulated distribution; 9 was used to compute the value 
of l 0 and l z . Then, the values of l 0 and l z of the original response patterns, also computed using 
0, were compared with the simulated distribution by determining the significance probability 
under the sampled distribution. Second, it was assumed that the true parameter value was 0; 
for example, for a simulee with true parameter 9 — 1.5 and 0 = 1.2, it was assumed that 
the true parameter value was known and was 1.5. For each simulee, 1, 000 replications were 
generated where the known true parameter value was set to 0. Doing this, estimation errors in 
0 are excluded from approximating the distribution of l 0 and l z . For each replicated response 
pattern the values of l 0 and l z were determined using true 0 to obtain the simulated distribution. 
Then, the values of l 0 and l z of the original response patterns, also computed using true 0, were 
compared with the simulated distribution by determining the significance probability under 
the sampling distribution. Also, for each dataset the mean absolute bias was determined as 



mab = ±yTx 



0-0 
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, where the sum is across all simulees. 

Note, that for conventional tests the distribution of Iq and l z are equivalent; the 
distribution is simulated conditional on 0 or 0, all items are the same, and therefore, for every 
replication E (Iq) and var (Iq) are the same. 

CAT. A dataset of 400 model fitting adaptive response patterns was constructed using 
the item pool and procedure described in Study 1, where true 0 was drawn from the standard 
normal distribution. The distributions of l 0i l z > and l* z were simulated in two different ways, both 
using parametric bootstrap methods. First, for each adaptive response vector these distributions 
were simulated using a fixed test design (cf. Equation 10). For each simulee, 500 response 
patterns were replicated where the test design was set to the observed test design. Thus, for 
each simulee, the administered test was viewed as a conventional test , and this conventional 
test was replicated 500 times, conditional on the value of 0 or 0; the values of l 0 and l z were 
computed for each replicated response pattern to obtain the distribution given 0 or 0 and d, 
whereas the values of l* z were only determined using 0. Then, the significance probability was 

22 
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determined by comparing the values of l 0 , l zy and l* z of the original response pattern with the 
simulated distribution. 

Second, the distributions of l 0 , / z , and l* z were simulated using the stochastic test design 
(cf. Equation 11). For each simulee, an adaptive test was replicated 500 times, where true 0 or 
0 was used; the values of l 0 and l z were computed using 0 or 0, whereas the values of l* z were 
only computed using 0. For each simulee, 500 adaptive response patterns given 0 or 0 were 
replicated according to the CAT-procedure described in Study 1; that is, P (0) or P ^0^ was 



used to generate responses to items. This procedure was repeated until SE < 0.25. Thus, 
for each simulee 500 adaptive response patterns were simulated conditional on the value of 0 
or 0. For each replicated response pattern the values of Z 0 , L, and l* z were determined to obtain 
the simulated distribution. Then, the values of h, and l* z of the original response patterns 
were compared with the simulated distribution by determining the significance probabilities. 
When the replications were generated using 0 all lo , l z * and l* z values were determined by using 
0 in Equations 2 and 3. When the replications were replicated using 0, all Iq and l z values 
were computed using 0 in order to determine a distribution without the presence of estimation 
errors. Although in practice 0 is unknown, determining the distribution based on 0 allow us to 
investigate the influence of 0 on the distribution of lo and l z . 

Note, that for the fixed design the distribution of lo> l zy and /* are equivalent; the 
distribution is simulated conditional on 0 or 0, all items are the same due to the fixed test design, 
and therefore, for every replication E ( Iq ) and var (Iq) are the same. 



P&P. In Table 4 the distribution of the significance probabilities of l z are given, using 0 
or 0 to determine the values of l z . To illustrate the distribution of the significance probabilities, 
ten intervals are considered, each of length 0.10. The expected proportion of simulees with 
a significance probability in a particular interval was 0.10; for example, it was expected that 
10% of the simulees have ^-values with 0.4 < p* < 0.5. To test whether the distribution of 
significance probabilities approached the uniform distribution Pearson’s chi-squared tests, X 2 , 
can be calculated, with E (X 2 ) = 9. Table 4 shows that, for all test lengths, the distribution of 
significance probabilities are uniformly distributed. However, in practice 0 is unknown and as 
an alternative 0 is used. Table 4 shows that, when 0 was used, for very short tests, containing 
only 10 items, the significance probabilities are not uniformly distributed (X 2 = 24.4, p- 
value= 0.04). For tests containing 20 items or more, the distribution of / z , using 0, is more 
in agreement with the uniform distribution. Table 4 also shows that, in general, for small values 
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Simulating the Null Distribution - 22 
of the MAB, the distribution can be approximated using simulation methods based on 6. For 
example, for the short test of 10 items and using 6 , MAB = 0.64 and X 2 = 24.4 whereas for 
the test of 80 items MAB = 0.26 and X 2 = 7.2. 

CAT. In Table 5 the distribution of the significance probabilities of l 0 , l z , and l\ using 
a fixed and a stochastic design are given using 6 and 6. Table 5 shows that, using 9 and 
using a fixed or stochastic test design resulted in an approximately uniform distribution of the 
significance probabilities for both l 0 and l z . However, conditioning on 0 and using a fixed or 
stochastic test design resulted in an inappropriate approximation of the distribution of fo, l z and 
T z . For example, using 6 and a stochastic test design with l z as fit index resulted in X 2 = 65.7, 
which is highly significant. Especially the probabilities in the left tail were much too small. 
Using a stochastic design, only 4.3% of the simulees attained a Rvalue with 0 < p* < 0.1. In 
practice, using 8 and a stochastic test design to simulate the distribution performs better than 
a using fixed test design. That is, the values of X 2 are lower when the stochastic test design 
was used compared with using a fixed design; for lo the values of X 2 for a fixed and stochastic 
test design were 81.7 and 34.8, respectively, for l z 81.7 and 65.7, and for l\ 81.7 and 65.7, 
respectively. 



Study 3 

In Study 1 it was shown that the critical value at a = 0.05 of the distribution of l* z was 
close to —1.65. This third study was designed to compare the detection rate of l* z with l z for 
several types of aberrant response behavior, when it is assumed that the theoretical distribution 
is standard normal. 

Method 

Several datasets containing 200 nonfitting response patterns were constructed, with 
three types of aberrant response behavior and three different conventional tests; tests containing 
10, 20 and 50 items. The first type was guessing on all the items in a test. This guessing 
model mimics the type of answering behavior studied empirically by Van den Brink (1977). 
He described examinees who took a multiple choice exam without preparation, and the only 
purpose of taking the exam was to become familiar with the type of questions that would be 
asked. Because returning an almost completely blank answering sheet may focus attention on 
an examinee’s ignorance, the examinee would randomly guess the correct answer on almost all 
items in the test. "Guessing” simulees were simulated by randomly guessing the correct answer 



Simulating the Null Distribution - 23 



o 

c 

.2 

3 

'5 



<D 

■s 

0) 

cd 



<Q0 

T3 

§ 

Qo 

00 

c 



c 

00 



4> 

T3 



cd 

_C 

o 

o 



o 

T3 

0> 

X 



00 

c 



5 

U 

cd 

V-. 

<2 

c/> 

a> 



-O 

cd 

X> 

o 



a) 

u 

c 

cd 

o 



c 

00 



o 

c 

o 

’w 

3 

X) 

'fi 



a> i_ 

s ° 

(2 _i 



CM 




VO 




VO 


r- 






00 




r- 












on 




on 


cm 




*M 


Tfr 


_* 


iri 


«M 


in 




N 














OO 


cn 


OO 


vo 


OO 


vo 




, — , 




00 


00 


00 


8 




cn 


m 


cn 


o 


cn 


cn 




o 






o 










r- 




r- 


vo 
































On 




o 


o 


d 


o 




o 


o 


o 


o 


o 


o 




O 
























































O^ 




s 


m 


8 


o 

CM 




o 


cn 

CM 


8 


cn 

cn 


o 


o 

cn 




O 




























00 " 




o 


o 


o 


o 




o 


o 


o 


o 


o 


o 




o 
























































, — , 




© 


o 


© 


8 




cn 


00 


cn 


cn 


cn 


00 




00 






00 






CM 


o 


CM 


CM 


CM 


CM 




o 






o 














— 




— 








o 


o 


d 


o 




o 


o 


o 


o 


o 


o 




o 
























































, — , 




o 


m 


o 


cn 




cn 


00 


cn 


o 


cn 


m 


13 


t"- 






*— H 




O 




cn 


CM 


cn 


CM 


cn 


CM 


o 


















•— < 




IH 




& 


VO 




o 


o 


o 


o 




o 


o 


o 


o 


o 


o 


<u 




























d 






















































c 


, — , 




8 


m 


8 


oo 




o 


o 


o 


oo 


o 


m 


CO 


NO 




o 


8 




rf 


CM 




cn 




cn 


.2 


o 






1 


*—4 






*— < 


*— < 


•— < 


< 


•— < 




in 




o 


d 


o 


o 




o 


o 


d 


o 


o 


o 


IS 


o 


























cd 




























X> 




























s 


f — 1 




00 


m 


00 


00 




OO 


o 


OO 


»n 


OO 


cn 


p , 


in 




8 


o 


ON 


00 




00 




00 


ON 


00 


o 


d> 

u 


<d 






o 


o 




o 




o 


o 


o 








o 


o 


o 


o 




d 


o 


o 


d 


o 


o 


i 


d 


























o 




























<m 




























*2 


r— i 




00 


o 


00 


00 




oo 


o 


00 


oo 


00 


00 


00 


<d 




00 

o 


ON 

o 


00 

O 


On 

o 




s 


00 

o 


s 


m 

o 


s 


m 

o 


CO 


cn 




o 


o 


o 


o 




o 


o 


o 


o 


o 


o 




o 
























































, — , 




00 


m 


oo 


m 




cn 


m 


cn 


cn 


cn 


cn 




cn 




On 


On 


ON 


o 




vo 


m 


vo 


vo 


vo 


vo 




o 




O 


O 


o 






O 


o 


O 


O 


O 


O 




CM~ 




O 


O 


d 


o 




o 


o 


o 


o 


o 


o 




o 
























































, — , 




00 


cn 


00 


cn 




m 


m 


m 


8 


m 


m 




CM 




oo 


ON 


00 


ON 




VO 




vo 


vo 


vo 




o 




O 


o 


o 


o 




o 


o 


o 


O 


o 


o 




— T 




o 


o 


o 


o 




o 


o 


o 


o 


o 


o 




o 
























































, — , 




cn 


m 


cn 


00 




o 


00 


o 


cn 


o 


cn 








On 


ON 


ON 


ON 




cn 


m 


cn 


S 


cn 


cn 




o 




O 


o 


o 


o 




o 


o 


O 


o 


O 




o 




o 


o 


o 


o 




o 


o 


o 


o 


o 


O 
































c 

op 






u 




.2 






.2 




2 




2 




































to 




to 






to 




to 




to 




<u 






cd 




cd 






cd 




cd 




cd 




T3 




T3 


J= 


T3 


J= 




T3 


JS 


T3 


j= 


T3 


JS 




, - 




0) 


O 


<D 


CJ 




0) 


o 


0) 


o 


0) 


o 




CO 




X 


O 


X 


o 




X 


o 


X 


o 


X 


o 




O 




Cm 


to 


<m 


to 




Cm 


to 


Cm 


to 


Cm 


to 








o 










o 




M 




* K 




























2 


















<QS 















8 . 



C 

.2 

"cd 

o 

Du 

a 

s 

m 

*0 

c 

cd 

C0 

<u 

<D 



CD 

CV 




Simulating the Null Distribution - 24 



on each item with a probability of 0.2 (assuming a test with five alternatives per item). 

Second, response vectors with a two-dimensional 9 parameter were simulated: a 
simulee had during the first half of the test another ability value than during the second half to 
respond to the items. Carelessness, fumbling or memorization of some items can be the cause of 
non- invariant abilities. Two datasets containing response vectors with a two-dimensional ability 
parameter were simulated by drawing two ability values, 9\ and 9 2 , from a bivariate standard 
normal distribution; the correlation between the two values was modeled by the parameter p. 
Thus, during the first half of the test P (9\) was used and during the second half P (0 2 ) was 
used to simulate the responses to the items. The values p = 0.8 and p = 0.6 were used here to 
simulate the response patterns. 

The third type of aberrant response vectors simulated were vectors with violations 
against local stochastic independence between the items of the test. When previous items 
provide new insights that are useful for answering the next item, or when the process of 
answering the items is exhausting, the assumption of local independence between the items 
may be violated. Four datasets were constructed with violations of the local independence 
assumption. These response vectors were simulated according to a model proposed by 
Jannarone (1986). Appendix B describes the model in detail. Using this model, the probability 
of correctly answering an item is now determined by the item parameters a and 6, the person 
parameter 9 and the association parameter 8. When 5 = 0 the model equals the 2PLM. 
Compared to the 2PLM, positive values of 8 result in a higher probability of a correct response, 
and negative values of 8 result in a lower probability of correctly answering an item. The values 
8 — —2, — 1, 1, and 2 were used to simulate these nonfitting response patterns. 

The detection rate of a statistic is defined here as the proportion of detected nonfitting 
response patterns. A response vector was classified as nonfitting the 2PLM when the observed 
value of l z or l\ was below the critical value at level a = 0.05 of the standard normal distribution, 
that is -1.65. 

For every dataset, the mean absolute bias, MAB, was determined and the mean bias 
was calculated as MB = £ 51” ^9 — 9^j . These variables were determined to investigate the 
trade off between bias and detection rate. 



In Table 6 the detection rates of l z and l\ are given for three types of aberrant response 
behavior, for conventional tests of 10, 20, and 50 items. Table 6 shows that for all types of 
aberrance and all tests the detection rate of l\ was slightly higher than the detection rate of 
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Table 6. Detection rates for several types of aberrant response behavior, for P&P-tests of length 
10, 20 and 50, using l z < -1.65 and /* < —1.65. 







k=10 






k=20 






k=50 




MAB 


lz 


it 


MAB 


lz 


it 


MAB 


lz 


it 


guessing 


2.27 


0.12 


0.27 


2.25 


0.45 


0.65 


2.04 


0.85 


0.96 


p = 0.6 


0.74 


0.05 


0.06 


0.60 


0.04 


0.07 


0.42 


0.06 


0.07 


0.8 


0.83 


0.03 


0.07 


0.53 


0.07 


0.08 


0.38 


0.04 


0.06 


6 = -2.0 


0.91 


0.05 


0.07 


0.83 


0.07 


0.09 


0.71 


0.18 


0.20 


-1.0 


0.69 


0.03 


0.05 


0.55 


0.06 


0.09 


0.46 


0.04 


0.06 


1.0 


1.09 


0.01 


0.04 


0.79 


0.03 


0.05 


0.58 


0.04 


0.06 


2.0 


1.53 


0.02 


0.03 


1.41 


0.05 


0.09 


1.23 


0.06 


0.07 
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l z . For example, for simulees guessing on all items of a conventional test of length 20, the 
detection rates for l\ and l z were 0.65 and 0.45, respectively. Table 6 also shows that for 
guessing the detection rates were reasonably high, whereas for violations of local independence 
and unidimensionality of 0, the detection rates for both l z and l\ were low for all test lengths. 
For example, for 50 items and guessing the detection rates were 0.85 and 0.96 for l z and /*, 
respectively, whereas for violation of unidimensionality and p — 0.6 the detection rates were 
0.06 and 0.07 for l z and /*, respectively. Table 6 also shows that the relation between MAB and 
detection rate was unclear. For example, for guessing on all items on the 50 items test, the MAB 
was high (2.04) and the detection rates were also high (0.85 and 0.96 for l z and /*, respectively). 
However, for violation of local independence and 6 = 2.0 and 10 items, the MAB was rather 
high (1.53) but the detection rates were low (0.02 and 0.03 for i z and /*, respectively). 

Discussion 

To detect examinees with inappropriate test scores, the use of person-fit statistics 
was investigated in this study. In particular the distribution using theoretical and simulated 
distributions, and using 0 and 0 in a conventional and CAT environment were explored. In Study 
1 the empirical distributions of l z and l* z were compared with the theoretical distribution (i.e., 
standard normal) for conventional and adaptive tests. Results showed that, for conventional 
tests, the distribution of l z differed across 0-levels. However, for 0-values between -1 and 1 
and long tests (50 — 80 items), the critical values at a = 0.05 of the simulated distribution 
(using 0 to determine the values of l z ) were close to the expected -1.65 (see Reise, 1995, for 
similar findings in the context of personality assessment). The critical values at a = 0.05 of the 
empirical distribution of /* for conventional tests and for all 0-values were found to be close to 
— 1.65, as expected under the standard normal distribution. With respect to CAT, results showed 
that the distribution of both l z and l* z differed across 0-levels and that the critical values of the 
theoretical distribution differed substantial from the critical values of the empirical distributions 
using 0. 

In Study 2, simulating the distributions of l 0 , l zy and l* z to create an approximation of the 
empirical distribution for conventional and adaptive tests was investigated. In a conventional 
testing context, especially for large positive and large negative 0-values, simulating a sampling 
distribution of l z for every examinee based on 0 resulted in an appropriate approximation of the 
distribution. With respect to CAT, simulating the distributions of Zo, Z 2 , and l * was problematic. 
For all three statistics, the left tails of the simulated distribution were inaccurate; for example. 
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using a stochastic test design to simulate the distribution of l 0 for every simulee resulted in only 
5.8% of the simulees attaining a value of Iq in the left 10% area of the distribution. 

In Study 3 the detection rate of l z and /* to detect nonfitting response patterns was 
investigated in a conventional testing context. Results showed that /* performed slightly better 
than l z for short tests. For long tests the differences in detection rates between l z and l* z 
were smaller because for long tests the critical values of the empirical distribution of l z were 
reasonably in agreement with the critical values of the standard normal distribution (see also 
Study 1). 

A possible solution for the problems in simulating the sampling distribution may be 
to use Bayesian methods to ’lift up’ the tails of the distribution. Other alternatives that may be 
considered in future research are using less biased estimators of 0 y or using statistics that are 
less dependent on 0. 
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Appendix A. Derivation of Z* 

Snijders (1998) derived the asymptotic distribution of statistics which are linear in the 
item responses, and in which 9 was replaced by an estimate. Statistics are linear in the item 
response when the statistic can be written as 

n 

^ X iWi (9) - w 0 (9 ) , (13) 

1=1 

where Wi (6) are suitable functions. Snijders used in his paper the centered form 

n 

W n ( e ) = £ (Xi - Pi (9)) Wi (9) . (14) 

1=1 



For example, W n = Z 0 “ E (Zo) and Wi = In results in the centered version of l 0 . The 

only restriction on the estimator 9 was, that 9 satisfied an equation of the form 



=o. 

t=l 



(15) 



For example, for the maximum likelihood estimator and the 2PLM, r 0 ^ = 0 and r* ^ = a* 
for i = 1, • • • ,n. 

The estimate satisfying 
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is the Warm estimator. Therefore, after some algebra, for the 2PLM, r* = and 



role 



Za*Pi[8 

1=1 



i -Pile 



1-2 Pile 



2 E°?Pi[0 
1=1 



I- Pile 



Snijders showed that the expected value of W n (o^j can be approximated by 



£(wUS)) ~-cn(o)r 0 (e 



and the variance by 



var 



(w n (e)) * nrl (e) , 



where 



i -Pile 



t=l 

9 ) = Wi(ej -Cn Ti ( 9 ) , and 

tp'MwAe 



Wi 



Cn(9) = 



1=1 



ZPi[8)ri[e 

1=1 ' ' 



He also showed that the asymptotic distribution of 



W n l9)+c n (9)r 0 (9 



s/nr n (9 



( 20 ) 



( 21 ) 



( 22 ) 



(23) 

(24) 

(25) 



(26) 



is standard normal. Note that the value of c„ ( 9 ) ro ( 9 ) does not depend directly on the patterns 
of item responses, but only on 9. 
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Appendix B. Modeling Local Dependence 

Let Xi 5 a realization of Xu be the response to item i. One of the models Jannarone 
(1986) presented was a conjunctive Rasch-model; local dependence between two subsequent 
items can be modeled by 



p(x i = x i ,x i+1 = x i+ 1 \e) = 



exp 


L Xj(p-bj)+XiX i+ i($-t iti+l ) 




l-fexp[0-6i]+exp[0-fci+i]+exp 


J=* 



(27) 



where £ i+1 is a parameter modeling association between items i and i + 1. This model can be 
generalized to a conjunctive 2PLM, which can be written as 



P(Xi 



Xi,Xi+i — %i + 1 | 0) ^ exp 



"i+l 

y! x j a j iP ~ bj) +£t£t+i7t,»+i (P ~ Ci,t+i) j 






(28) 



where 7 and £ are parameters modeling association between items. 

In this study, the following model was used to simulated response vectors with local 
independence between all subsequent items was 



P (Xi = x u X i+1 = x;+i | 0) a exp 



"t+i 



(29) 



where <5 i|i+1 is a parameter modeling association between items. The four possible realizations 
of (Xu X i+ i ) have the following probabilities 

s 

P(Xi = 0,X i+1 = 0) a 1, 

P (Xi = 1, X i+ i = 0) oc exp [ai (6 - 6i)] , 

P (Xi = 0, X i+ i = 1) oc exp [a i+ i (6 - 6 i+1 )] , and 

P (Xi = 1, X i+ i = 1) a exp [a; (0 - 6i) + Oi+1 (0 - 6 i+ i) + 5 iii+ i] . 



The conditional probability of a correct response to item i 4- 1 given a correct response to item 
i can be written as 



p(x i+1 = i\Xi = i,e) 



p(Xj=i,x i+1 =m 

P(X i =l,X i+1 =O|0)+P(X i =l,X i+1 =l|0) 

exp[ai(g-6i)+at + i(g-6i + i)+<5i,i+i] 

exp[ot(^-6i)j+exp[ot(0-6i)+Ot + i(0-fei + i)+6i t i + ij ’ 



( 30 ) 
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and the probability of a correct response to item i + 1 given an incorrect response to the previous 
item can be written as 



P(X i+1 = l\Xi = 0,6) = 



exp[ai + i(e-bi +1 )) 
l+exp[ai + i(0-6 i+ i)] ’ 



(31) 



which is the 2PLM. The conditional probabilities in Equation 30 and the 2PLM were used to 
simulate the responses to the items when the items are local stochastic dependent. 
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